Chapter 1 - Number crunching and matrix manipulation with Numpy

Numpy's typed array

At the most basic level of computing, numpy ('Numerical Python') has become an indispensable package for scientific computing in Python, on which many other packages fundamentally rely. Numpy is conventionally imported under the name np:


In [ ]:
import numpy as np

Numpy is a library which is typically used for hardcore number crunching. It is most faster than standard Python at this task, because it comes with a great deal of highly optimized routines. However, to achieve this speed-up, numpy asks us to give up an advantage of Python: it's typeless nature. Recall that Python is normally super-flexible when it comes of the types of the variables that we work with. We can easily create and reassign different sorts of data types to variables, switching between integers, floats, booleans and even strings:


In [ ]:
nb = 5
print(type(nb))
nb = 6.0
print(type(nb))
nb = 'number'
print(type(nb))
nb = False
print(type(nb))

This flexibility is commonly referred to as Python's 'typeless' nature: at least from the coder's perspective, Python does not seem to care about the kind of data we store inside variables. The same flexibility is offered by Python's collection-like data structures. Python's lists and tuples ('arrays'), for instance, allow us to freely mix different data types:


In [ ]:
# create a mixed list first:
arr = ['number', 5, 6.0]
print(type(arr))

# check the individual data types:
for a in arr:
    print('\t-', type(a))

# turn this into a tuple:
arr = tuple(arr)
print(type(arr))

# check the individual data types:
for a in arr:
    print('\t-', type(a))

This awesome flexibility is something Java-coders can only dream of! Naturally, this flexibility comes with a high cost - and is fact, is one of the main reasons why Python is considered a relative 'slow' language. When executing code in Python, your machine can never be sure what to expect: a variable which was initially defined to hold a simple number, can easily be reassigned to hold a book-long string! Thus, Python constantly must constantly manage and anticipate these complex, memory-related requirements.

Numpy changes these requirements, by defining a typed data object: the array object, which is in many ways to most central workhorse in the package. This object can be used to efficiently work with large arrays of numbers. The only admission which Python asks from us, is that we only store one type of data in this array. These numbers will typically be (floating-point) numbers. Let us start by creating a random list of floats in normal Python, and then convert it to a numpy array:


In [ ]:
nbs = [0.2, 4.5, 0.7, -0.9, 12.67]
a = np.array(nbs)
print(type(a))

Note how the representation of our value collection is slightly altered, when printed (no comma's):


In [ ]:
print(nbs)
print(a)

A numpy array has two fundamental properties: it has a data type (array.dtype) and a shape (array.shape):


In [ ]:
print(a.dtype)

The dtype reflects the data type of the numbers stored in the array. In this case, we are dealing with float64 numbers which apparently is the default type of number used when creating an array. This property can be controlled using the dtype property in the constructor. To create an array with simple integers, for instance, we could do:


In [ ]:
i = np.array(range(0, 5), dtype='int8')
print(i)
print(i.dtype)

The shape property captures the structure of our number collection, which is closely related to the concept of 'dimensionality'. In this case, our array only has a single 'dimension', with a 'dimensionality' of 5 - because our list contains 5 numbers:


In [ ]:
print(a.shape)

Now that we can pass both Python lists and (immutable) tuples to the array() constructor. Make sure however to properly specify a data type if numpy might get confused when it comes to casting. Compare the following situations:


In [ ]:
# mixed:
nbs = [1, 2, 3.0]
a = np.array(nbs)
print(a.dtype)

# just ints:
nbs = [1, 2, 3]
a = np.array(nbs)
print(a.dtype)

# explicit cast:
nbs = [1.0, 2.0, 3.0]
a = np.array(nbs, dtype='int8')
print(a.dtype)

Matrices

A single list or array of numbers is commonly refered to as a 'vector' in computing. Conventionally, vectors take lowercase (single-character) names, such as a or i above. When working with a 'list of lists', it is more common to speak about these nested collections of numbers as 'matrices' (singular: 'matrix'), which typically take uppercase names. Here goes an example of such a list of lists:


In [ ]:
lists = [[1, 3, 4,],
         [2, 3, 4],
         [8, 9, 9]]
M = np.array(lists)
print(M)

This matrix has a two dimensions, the shape of which can inspected via the tuple returned by array.shape:


In [ ]:
print(M.shape)

This property shows that our two-dimensional matrix has three 'rows' and three 'columns'. In fact, a matrix is a very common data structure in stylometry, because we typically represent a corpus of texts as a spreadsheet-like table, where rows represent documents, and columns represent words. The cells in the corpus matrix could for instance hold the relative frequencies of each word in each texts.

Initializing matrices with values is easy in numpy. To 'prepopulate' a matrix with certain values, we could the following line, where a tuple indicates the desired shape of the matrix:


In [ ]:
M2 = np.zeros((4, 6))
print(M2)
print(M2.shape)

For a matrix of ones, we have the equivalent:


In [ ]:
M_ones = np.ones((8, 4))
print(M_ones)
print(M_ones.shape)

To fill such a structure with another values, we can use fill():


In [ ]:
M.fill(1.0)
print(M)

Arrays aren't limited to two dimensions: just as in standard Python we would easily create a list of lists of lists etc. by inputting a nested list structure or specifying this structure as follows:


In [ ]:
M3 = np.zeros((3, 4, 2))
print(M3)

Such deeper matrices are often called 'tensors' nowadays, but are less common in stylometry. Just bear in mind that dtype and shape are the main properties to keep track of when dealing with Numpy's array. The exact number of dimensions can also be checked using the ndim property:


In [ ]:
print(a.ndim)  # vector
print(M2.ndim) # matrix
print(M3.ndim) # tensor

Indexing

Learning how to handle and manipulate numpy arrays is crucial nowadays. Just as with normal collections in Python, such as tuples, we need efficient ways to access parts of our arrays, such as a single row or column. Numpy offers good support for such based indexing, using an indexing mechanism that is very similar to what you might know from R or Matlab. Let us create a 2D matrix with random values between 0 and 1, and practice our indexing skills:


In [ ]:
X = np.random.uniform(size=(5, 4), low=0, high=1)
print(X)

What we intuitively think of as 'rows' are always the first (and in some ways principal) dimension in numpy. If we use standard indexing in numpy, we therefore index the matrix at the row-level. Naturally, we use zero-based indexing and start counting at 1, all the way up to -1:


In [ ]:
print(X[0])
print(X[3])
print(X[-1])

Specifying ranges of indices - to obtain slices of data, instead of individual data points - is also easy:


In [ ]:
print(X[2:5])

Of course, the usual shortcuts apply:


In [ ]:
print(X[:3]) # all rows up to (and not including) row 3
print(X[-2:]) # every row after (and including) the last-but-two row

A very useful gimmick is that the indices passed, can be discontinuous, which will prove to be amazingly useful when having to select specific parts of your data:


In [ ]:
rngs = [0, 2, 4]
print(X[rngs])

As you can see, only the relevant rows are included in our indexed selection. Because numpy array have multiple dimensions, we can index our data in the other dimensions too. To index entire columns, we can use the following syntax:


In [ ]:
print(X[:, 1])

Using the colon syntax (:), we indicate in fact that we want to index the 2nd column across all rows. If we only wanted to extract this column for the first two rows, we could have done:


In [ ]:
print(X[:2, 1])

Here, we obtain a flat vector, instead of a vertical columns, so to speak. Of course, when we start playing with more advanced slicing strategies, we can obtain matrices too:


In [ ]:
print(X[:2, 1:3])

Here, we obtain a matrix which represent a 2D slice of our original matrix, namely the cells for the first rows and the second and third column.

Note that with columns too, the indices can be discontinuous, which is useful if we want to make a subselection of specific columns:


In [ ]:
rngs = [1, 3]
print(X[:, rngs])

As an aside, we should note that other indexing mechanisms are available too, although they can be slower. Consider this example:


In [ ]:
print(X[0][2])

This is less fast, because in reality we will select the first row, of which a full copy is returned. Only then, we will index that copy for its third element, which is of course less efficient. Finally, we should note that we can also use a 'stepping' argument when indexing, which you will probably not need very often. To select all (un)even rows from a matrix, we could use the following syntax, where our indexer always skips a beat, by specifying that steps of 2 should be used:


In [ ]:
print(X[0::2]) # uneven, starting at 1st element
print(X[1::2]) # even, starting at 2nd element

Of course, indexing can also be used to reset a specific part of a matrix. This works for individual elements:


In [ ]:
r = np.zeros((3, 3))
print(r)
r[1][2] = 3

but also entire rows and columns:


In [ ]:
g = np.ones((3,))
r[1] = g
print(r)

Note that the dimensionality has to work out in the latter case of course. It is impossible to turn a vector with 4 items into a row in a matrix which only has 3 columns:


In [ ]:
f = np.ones((4,))
r[1] = f

(We will say more about 'broadcasting' in the next section.) One final, yet important remark in the context of assignment concerns the following warning: when indexing numpy arrays, we obtain a specific 'view' of the original array, and not a completely new object. This is efficient in many cases, but it can also lead to nasty situations, with unintended effects:


In [ ]:
# we create a new matrix of zeros:
a = np.zeros((2, 2))
print(a)

# we select the first row and assign it to b:
b = a[0]

# we change b...
b[1] = 3

# and apparently, we also changed a!
print(a)

Please keep this behaviour in mind! Using an explicit and full copy here, might make more sense, which can be achieved using copy():


In [ ]:
a = np.zeros((2, 2))
b = a[0].copy()
b[1] = 3
print(a)

Manipulating matrixes

Once we have created matrices, it becomes interesting to actually start to work with them. Matrices can be manipulated in many ways. One common action is transposing a matrix, a simple action which 'inverts' a matrix, so that the rows become columns - and the other way round:


In [ ]:
M_orig = np.random.uniform(size=(3, 5))
print(M_orig)
M_trans = M_orig.transpose()
print(M_trans)

To check whether everything worked well, we can do:


In [ ]:
M_trans2 = M_trans.transpose()
print(M_trans2 == M_orig)

The final line in previous code block raises an interesting issue: the equality operator (==) here, does not simply check whether the two matrixes are simply identical. Rather, it zips together the equivalent cells in both matrices and check whether these are equal individually, i.e. in an 'element-wise' fashion. If we wanted to have check the full equivalence, we should have used:


In [ ]:
print(np.array_equal(M_trans2, M_orig))

Combining existing matrices into a new matrix is also a common operation. This can be achieved via, for instance, the lines:


In [ ]:
M1 = np.zeros((2, 3))
print(M1)
M2 = np.ones((2, 3))
print(M2)
M3 = np.zeros((2, 3))
print(M3)
X = np.concatenate((M1, M2, M3))
print(X)

Here, we pass a tuple of matrices to concatenate(). Note that, by default, numpy will attempt to stack the matrices at level of rows (i.e. 'vertically'). Via the axis argument (which is 0 be default), we can control along which dimension the input matrices will be concatenated. In the following example, we join the matrices together as rows. Note that the axis parameter controls in which dimension we stack


In [ ]:
X = np.concatenate((M1, M2, M3), axis=1)
print(X)

For such stacking operations, there exist two equivalent methods, namely hstack() for horizontal stacking and vstack() for vertical stacking:


In [ ]:
X = np.vstack((M1, M2, M3)) # equivalent to np.concatenate(..., axis=0)
print(X)
X = np.hstack((M1, M2, M3)) # equivalent to np.concatenate(..., axis=1)
print(X)

Here too, we can only combine matrices which have compatible shapes:


In [ ]:
M1 = np.zeros((3, 4))
M2 = np.ones((4, 4))
X = np.concatenate((M1, M2), axis=0) # does work
#X = np.concatenate((M1, M2), axis=1) # does not work

Exercise

  • Define a 1000 by 300 matrix of zeroes. Fill the last 30 columns with random values between 3 and 5. Now calculate the average of all cells on the main diagonal of the matrix (i.e. at the indices [0][0], [1][1], [2][2], ...).
  • Automatically create a numpy array that looks like this:

In [ ]:
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.]
 [  0.  10.  20.  30.  40.  50.  60.  70.  80.  90.]]